• 1 Homework Header
  • 2 Problem 2.1 (Data Manipulation):
    • 2.1 Student response
    • 2.2 Summary for ALS
    • 2.3 Summary for knee_df as a total and per view
  • 3 Problem 2.2 (Bivariate Relations Plots):
    • 3.1 Student response
    • 3.2 Student response
  • 4 Problem 2.3 (Missing Data)
    • 4.1 Student response
  • 5 Problem 2.4 (Surface Plots)
    • 5.1 Student response
  • 6 Problem 2.5 (Sample-Size Rebalancing)
    • 6.1 Student response

1 Homework Header

HW 2

Fall 2020, DSPA (HS650)

Name: Danny Siu

SID: #### - 9281 (last 4 digits only)

UMich E-mail:

I certify that the following paper represents my own independent work and conforms with the guidelines of academic honesty described in the UMich student handbook.

Remember you are allowed and encouraged to discuss, on a conceptual level, the problems with your class mates, however, this can not involve the exchange of actual code, printouts, solutions, e-mails or other explicit electronic or paper handouts.

2 Problem 2.1 (Data Manipulation):

Load the following two datasets separately, generate summary statistics for all features, plot some of the features using histograms, box plots, density plots, etc., as appropriate, and save the summaries locally as Text files.

2.1 Student response

I made sure to load the data with external downlaod links and sources so that the work is replicable. There are 101 features in ALS_train, but 131 features in ALS_test. Within the knee pain dataset there are x and y coordinate with 4 types of views in a long format.

##   ID Age_mean Albumin_max Albumin_median Albumin_min Albumin_range ALSFRS_slope
## 1  1       65          57           40.5          38   0.066202091   -0.9656085
## 2  2       48          45           41.0          39   0.010452962   -0.9217172
## 3  3       38          50           47.0          45   0.008928571   -0.9147870
## 4  4       63          47           44.0          41   0.012111135   -0.5983607
## 5  5       63          47           45.5          42   0.008291874   -0.4440389
## 6  6       36          51           47.0          46   0.009057971   -0.1183528
##   ALSFRS_Total_max ALSFRS_Total_median ALSFRS_Total_min ALSFRS_Total_range
## 1               30                28.0               22         0.02116402
## 2               37                33.0               21         0.02872531
## 3               24                14.0               10         0.02500000
## 4               30                29.0               24         0.01496259
## 5               32                27.5               20         0.02037351
## 6               37                34.5               27         0.01811594
##   ALT.SGPT._max ALT.SGPT._median ALT.SGPT._min ALT.SGPT._range AST.SGOT._max
## 1            24             22.0            18      0.02090592            31
## 2            25             13.0             8      0.02961672            31
## 3            25             20.0            14      0.01964286            24
## 4            62             60.0            41      0.05236908            46
## 5            38             26.5            22      0.02653400            35
## 6            34             23.0            18      0.02898551            31
##   AST.SGOT._median AST.SGOT._min AST.SGOT._range Bicarbonate_max
## 1             27.5            23      0.02787456              30
## 2             17.0            14      0.02961672              32
## 3             19.0            18      0.01071429              35
## 4             40.0            33      0.03241895              23
## 5             26.5            20      0.02487562              32
## 6             26.0            21      0.01811594              29
##   Bicarbonate_median Bicarbonate_min Bicarbonate_range
## 1                 28              25       0.017421603
## 2                 28              25       0.012195122
## 3                 29              24       0.019642857
## 4                 20              20       0.007481297
## 5                 28              23       0.014925373
## 6                 26              22       0.012681159
##   Blood.Urea.Nitrogen..BUN._max Blood.Urea.Nitrogen..BUN._median
## 1                        8.0322                          7.11945
## 2                        8.3973                          4.74630
## 3                        5.4765                          4.38120
## 4                        8.0322                          8.03220
## 5                        5.1114                          4.19865
## 6                        6.5718                          5.11140
##   Blood.Urea.Nitrogen..BUN._min Blood.Urea.Nitrogen..BUN._range
## 1                        6.5718                     0.005088502
## 2                        4.0161                     0.007632753
## 3                        3.6510                     0.003259821
## 4                        6.5718                     0.003641895
## 5                        3.6510                     0.002421891
## 6                        4.0161                     0.004629891
##   bp_diastolic_max bp_diastolic_median bp_diastolic_min bp_diastolic_range
## 1               90                  83               69         0.05555556
## 2               80                  78               64         0.02872531
## 3               86                  76               58         0.05000000
## 4               90                  80               70         0.04987531
## 5              100                  80               68         0.05306799
## 6               84                  80               60         0.04347826
##   bp_systolic_max bp_systolic_median bp_systolic_min bp_systolic_range
## 1             160              139.0             129        0.08201058
## 2             140              132.5             104        0.06463196
## 3             120              110.0              90        0.05357143
## 4             150              130.0             120        0.07481297
## 5             160              130.0             104        0.09286899
## 6             140              115.0             100        0.07246377
##   Calcium_max Calcium_median Calcium_min Calcium_range Chloride_max
## 1     2.49500       2.220550     2.22055   0.000956272          109
## 2     2.32035       2.170650     2.02095   0.000521603          108
## 3     2.47005       2.295400     2.19560   0.000490089          108
## 4     2.47005       2.345300     2.23000   0.000473934          109
## 5     2.42015       2.257975     2.17065   0.000413765          107
## 6     2.39520       2.270450     2.17065   0.000406793          110
##   Chloride_median Chloride_min Chloride_range Creatinine_max Creatinine_median
## 1             108          103    0.020905923          79.56             79.56
## 2             102          100    0.013937282          61.88             53.04
## 3             106          104    0.007142857          88.40             79.56
## 4             107          106    0.007481297          70.72             61.88
## 5             104          100    0.011608624          61.88             48.62
## 6             105          101    0.016304348         106.08             88.40
##   Creatinine_min Creatinine_range Gender_mean Glucose_max Glucose_median
## 1          70.72       0.03080139           1      7.4370         4.4955
## 2          44.20       0.03080139           1      6.7710         4.9950
## 3          70.72       0.03157143           2      5.6610         5.1060
## 4          53.04       0.04408978           2      5.1060         4.7730
## 5          26.52       0.05864013           1      7.4925         5.7165
## 6          70.72       0.06405797           2      5.5500         5.1060
##   Glucose_min Glucose_range hands_max hands_median hands_min hands_range
## 1      4.2180   0.011216028         8          7.5         6 0.005291005
## 2      4.0515   0.004737805         8          6.0         6 0.003590664
## 3      4.2180   0.002576786         4          1.0         0 0.007142857
## 4      4.6620   0.001107232         6          5.5         4 0.004987531
## 5      5.0505   0.004049751         8          6.5         3 0.008488964
## 6      4.4400   0.002010870         8          7.0         5 0.005434783
##   Hematocrit_max Hematocrit_median Hematocrit_min Hematocrit_range
## 1           44.6             43.15           40.7      0.013588850
## 2           41.9             39.60           37.7      0.007317073
## 3           49.1             46.20           44.0      0.009107143
## 4           46.3             43.00           41.7      0.011471322
## 5           44.0             42.85           39.5      0.007462687
## 6           46.8             43.50           41.9      0.008876812
##   Hemoglobin_max Hemoglobin_median Hemoglobin_min Hemoglobin_range leg_max
## 1            156             146.0            143       0.04529617       8
## 2            138             132.0            128       0.01742160       8
## 3            161             154.0            151       0.01785714       4
## 4            154             145.0            144       0.02493766       4
## 5            152             146.5            138       0.02321725       2
## 6            157             146.0            142       0.02717391       8
##   leg_median leg_min   leg_range mouth_max mouth_median mouth_min mouth_range
## 1        6.5       4 0.010582011         5          3.5         0 0.013227513
## 2        7.5       3 0.008976661         9          8.0         4 0.008976661
## 3        3.0       2 0.003571429        10          7.0         4 0.010714286
## 4        3.5       2 0.004987531        12         12.0        12 0.000000000
## 5        2.0       0 0.003395586        12         12.0        12 0.000000000
## 6        8.0       4 0.007246377         9          8.0         7 0.003623188
##   onset_delta_mean onset_site_mean Platelets_max Platelets_median Platelets_min
## 1            -1023               1           172            169.0           152
## 2             -341               1           286            264.0           230
## 3            -1181               1           233            213.0           167
## 4             -365               2           275            233.0           204
## 5            -1768               2           313            283.5           268
## 6             -334               1           220            194.0           178
##   Potassium_max Potassium_median Potassium_min Potassium_range pulse_max
## 1           4.5             4.25           4.0     0.001742160        79
## 2           5.0             4.30           3.9     0.001916376        90
## 3           4.1             4.00           3.9     0.000357143        82
## 4           4.3             4.20           4.0     0.000748130        84
## 5           4.6             3.75           3.5     0.001824212       101
## 6           4.5             4.30           4.2     0.000543478        88
##   pulse_median pulse_min pulse_range respiratory_max respiratory_median
## 1           68        61  0.04761905               4                  3
## 2           76        64  0.04667864               4                  4
## 3           73        60  0.03928571               4                  4
## 4           72        68  0.03990025               3                  3
## 5           96        74  0.04477612               4                  4
## 6           66        60  0.05072464               4                  4
##   respiratory_min respiratory_range Sodium_max Sodium_median Sodium_min
## 1               3       0.002645503        148         145.5        143
## 2               3       0.001795332        142         138.0        136
## 3               4       0.000000000        145         143.0        140
## 4               3       0.000000000        143         139.0        138
## 5               3       0.001697793        143         140.0        138
## 6               3       0.001811594        145         141.0        137
##   Sodium_range SubjectID trunk_max trunk_median trunk_min trunk_range
## 1  0.017421603       533         8            7         7 0.002645503
## 2  0.010452962       649         8            7         5 0.005385996
## 3  0.008928571      1234         5            0         0 0.008928571
## 4  0.012468828      2492         5            5         3 0.004987531
## 5  0.008291874      2956         6            4         1 0.008488964
## 6  0.014492754      3085         8            8         7 0.001811594
##   Urine.Ph_max Urine.Ph_median Urine.Ph_min
## 1            6               6            6
## 2            7               5            5
## 3            6               5            5
## 4            7               6            5
## 5            6               5            5
## 6            8               6            5
##    x   Y View
## 1 11  73   RF
## 2 20  88   RF
## 3 19  73   RF
## 4 15  65   RF
## 5 21  57   RF
## 6 26 101   RF

Next, I want to make summary statistics for all of the features. For one, I notice that the Test dataset for ALS has more features than the training dataset. Let’s find out what those column differences are

##  [1] "Basophils_max"                 "Basophils_median"             
##  [3] "Basophils_min"                 "Basophils_range"              
##  [5] "Bilirubin..total._max"         "Bilirubin..total._median"     
##  [7] "Bilirubin..total._min"         "Bilirubin..total._range"      
##  [9] "BMI_max"                       "Eosinophils_max"              
## [11] "Eosinophils_median"            "Eosinophils_min"              
## [13] "Eosinophils_range"             "Lymphocytes_max"              
## [15] "Lymphocytes_median"            "Lymphocytes_min"              
## [17] "Lymphocytes_range"             "Monocytes_max"                
## [19] "Monocytes_median"              "Monocytes_min"                
## [21] "Monocytes_range"               "Red.Blood.Cells..RBC._max"    
## [23] "Red.Blood.Cells..RBC._median"  "Red.Blood.Cells..RBC._min"    
## [25] "Red.Blood.Cells..RBC._range"   "Urine.Ph_range"               
## [27] "White.Blood.Cell..WBC._max"    "White.Blood.Cell..WBC._median"
## [29] "White.Blood.Cell..WBC._min"    "White.Blood.Cell..WBC._range"

2.2 Summary for ALS

##                                      mean      sd   median
## ID                               1214.875 696.678 1213.000
## Age_mean                           54.550  11.397   55.000
## Albumin_max                        47.011   3.234   47.000
## Albumin_median                     43.953   2.655   44.000
## Albumin_min                        40.766   3.193   41.000
## Albumin_range                       0.014   0.010    0.012
## ALSFRS_slope                       -0.728   0.622   -0.621
## ALSFRS_Total_max                   31.692   5.314   33.000
## ALSFRS_Total_median                27.105   6.634   28.000
## ALSFRS_Total_min                   19.877   8.584   20.000
## ALSFRS_Total_range                  0.026   0.016    0.023
## ALT.SGPT._max                      54.436  44.830   45.000
## ALT.SGPT._median                   32.993  15.602   30.000
## ALT.SGPT._min                      23.015  11.231   21.000
## ALT.SGPT._range                     0.071   0.111    0.048
## AST.SGOT._max                      43.128  35.289   38.000
## AST.SGOT._median                   29.077   9.594   27.000
## AST.SGOT._min                      21.542   7.395   20.000
## AST.SGOT._range                     0.049   0.084    0.035
## Bicarbonate_max                    30.897   3.164   31.000
## Bicarbonate_median                 26.964   2.199   27.000
## Bicarbonate_min                    23.164   2.409   23.000
## Bicarbonate_range                   0.017   0.011    0.015
## Blood.Urea.Nitrogen..BUN._max       7.353   2.320    6.937
## Blood.Urea.Nitrogen..BUN._median    5.558   1.335    5.423
## Blood.Urea.Nitrogen..BUN._min       4.161   1.354    4.070
## Blood.Urea.Nitrogen..BUN._range     0.007   0.005    0.006
## bp_diastolic_max                   92.031   8.758   90.000
## bp_diastolic_median                81.113   7.246   80.000
## bp_diastolic_min                   69.891   8.444   70.000
## [1] 0

2.3 Summary for knee_df as a total and per view

##        x               Y             View          
##  Min.   : 11.0   Min.   : 34.0   Length:8666       
##  1st Qu.: 95.0   1st Qu.:192.0   Class :character  
##  Median :200.0   Median :210.0   Mode  :character  
##  Mean   :224.4   Mean   :210.2                     
##  3rd Qu.:241.0   3rd Qu.:226.0                     
##  Max.   :642.0   Max.   :380.0
## knee_df$View: LB
##        x               Y             View          
##  Min.   :384.0   Min.   : 64.0   Length:924        
##  1st Qu.:426.0   1st Qu.:186.0   Class :character  
##  Median :444.0   Median :194.0   Mode  :character  
##  Mean   :443.0   Mean   :201.8                     
##  3rd Qu.:452.2   3rd Qu.:209.0                     
##  Max.   :498.0   Max.   :374.0                     
## ------------------------------------------------------------ 
## knee_df$View: LF
##        x               Y             View          
##  Min.   :164.0   Min.   : 49.0   Length:3369       
##  1st Qu.:200.0   1st Qu.:197.0   Class :character  
##  Median :211.0   Median :213.0   Mode  :character  
##  Mean   :211.4   Mean   :212.2                     
##  3rd Qu.:223.0   3rd Qu.:228.0                     
##  Max.   :287.0   Max.   :368.0                     
## ------------------------------------------------------------ 
## knee_df$View: RB
##        x               Y             View          
##  Min.   :520.0   Min.   : 34.0   Length:882        
##  1st Qu.:563.0   1st Qu.:186.0   Class :character  
##  Median :572.0   Median :194.0   Mode  :character  
##  Mean   :572.7   Mean   :201.4                     
##  3rd Qu.:590.0   3rd Qu.:209.0                     
##  Max.   :642.0   Max.   :366.0                     
## ------------------------------------------------------------ 
## knee_df$View: RF
##        x                Y             View          
##  Min.   : 11.00   Min.   : 57.0   Length:3491       
##  1st Qu.: 80.00   1st Qu.:198.0   Class :character  
##  Median : 91.00   Median :213.0   Mode  :character  
##  Mean   : 91.06   Mean   :212.6                     
##  3rd Qu.:104.50   3rd Qu.:227.0                     
##  Max.   :137.00   Max.   :380.0

Next, plot the first 10 features using histograms, box plots, density plots Then, finally make a heatmap of all the features

3 Problem 2.2 (Bivariate Relations Plots):

Use ALS case-study data and SOCR Knee Pain Data (Links to an external site.) to explore some bivariate relations (e.g. bivariate plot, correlation, table crosstable etc.) Use 07_UMich_AnnArbor_MI_TempPrecipitation_HistData_1900_2015 data to show the relations between temperature and time. [Hint: use geom_line or geom_bar]. Some sample code is included below.

First let’s look at ALS bivariate relationships

3.1 Student response

There seems to be evidence that the temperature in Ann Arbor is sloping upwards. This is more apparent in February and Jun from 1900s to 1950s. Putting everything on the same scale like a bar plot makes it much more difficult to discern any differences

##              
##               (37,40.3] (40.3,43.7] (43.7,47] (47,50.3] (50.3,53.6] (53.6,57]
##   (17.9,30.6]         0           0         4        19           9         2
##   (30.6,43.2]         4          21        70       222          71         7
##   (43.2,55.8]        23          51       192       355          73        12
##   (55.8,68.4]        31          81       283       372          40         3
##   (68.4,81.1]        21          43        95        95           9         0
##              
##               (57,60.3] (60.3,63.6] (63.6,67] (67,70.3]
##   (17.9,30.6]         0           0         0         0
##   (30.6,43.2]         0           0         1         0
##   (43.2,55.8]         1           0         0         0
##   (55.8,68.4]         5           2         0         2
##   (68.4,81.1]         1           2         1         0
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  2223 
## 
##  
##                                  | cut(ALS_train$Glucose_max, 4) 
## as.factor(ALS_train$Gender_mean) | (4.13,11.5] | (11.5,18.9] | (18.9,26.3] | (26.3,33.7] |   Row Total | 
## ---------------------------------|-------------|-------------|-------------|-------------|-------------|
##                                1 |         779 |          24 |           2 |           1 |         806 | 
##                                  |       0.139 |       1.234 |       2.174 |       0.364 |             | 
##                                  |       0.967 |       0.030 |       0.002 |       0.001 |       0.363 | 
##                                  |       0.367 |       0.289 |       0.133 |       0.200 |             | 
##                                  |       0.350 |       0.011 |       0.001 |       0.000 |             | 
## ---------------------------------|-------------|-------------|-------------|-------------|-------------|
##                                2 |        1341 |          59 |          13 |           4 |        1417 | 
##                                  |       0.079 |       0.702 |       1.237 |       0.207 |             | 
##                                  |       0.946 |       0.042 |       0.009 |       0.003 |       0.637 | 
##                                  |       0.633 |       0.711 |       0.867 |       0.800 |             | 
##                                  |       0.603 |       0.027 |       0.006 |       0.002 |             | 
## ---------------------------------|-------------|-------------|-------------|-------------|-------------|
##                     Column Total |        2120 |          83 |          15 |           5 |        2223 | 
##                                  |       0.954 |       0.037 |       0.007 |       0.002 |             | 
## ---------------------------------|-------------|-------------|-------------|-------------|-------------|
## 
## 

Then, let’s look at knee data

## # A tibble: 4 x 2
##   View        c
##   <chr>   <dbl>
## 1 LB     0.154 
## 2 LF    -0.107 
## 3 RB    -0.116 
## 4 RF     0.0538

Now let’s take a look at the temperature data for the past century

3.2 Student response

There seems to be evidence that the temperature in Ann Arbor is sloping upwards. This is more apparent in February and Jun from 1900s to 1950s. Putting everything on the same scale like a bar plot makes it much more difficult to discern any differences

##       Year           Jan             Feb             Mar             Apr       
##  Min.   :1900   Min.   :11.40   Min.   :14.00   Min.   :24.70   Min.   :37.50  
##  1st Qu.:1929   1st Qu.:20.95   1st Qu.:22.35   1st Qu.:32.65   1st Qu.:45.15  
##  Median :1958   Median :24.20   Median :26.10   Median :35.30   Median :47.60  
##  Mean   :1958   Mean   :24.22   Mean   :25.45   Mean   :35.39   Mean   :47.55  
##  3rd Qu.:1986   3rd Qu.:27.65   3rd Qu.:28.90   3rd Qu.:38.15   3rd Qu.:50.15  
##  Max.   :2015   Max.   :35.00   Max.   :35.60   Max.   :50.70   Max.   :54.50  
##                 NA's   :1       NA's   :1       NA's   :1       NA's   :1      
##       May             Jun             Jul             Aug       
##  Min.   :49.40   Min.   :61.90   Min.   :67.80   Min.   :64.90  
##  1st Qu.:56.60   1st Qu.:66.45   1st Qu.:71.10   1st Qu.:68.85  
##  Median :58.70   Median :68.40   Median :72.60   Median :70.90  
##  Mean   :58.85   Mean   :68.22   Mean   :72.66   Mean   :70.74  
##  3rd Qu.:61.65   3rd Qu.:70.50   3rd Qu.:73.70   3rd Qu.:72.40  
##  Max.   :66.60   Max.   :74.30   Max.   :78.90   Max.   :76.20  
##  NA's   :1                                       NA's   :1      
##       Sep             Oct             Nov             Dec       
##  Min.   :54.80   Min.   :42.70   Min.   :32.90   Min.   :17.70  
##  1st Qu.:62.20   1st Qu.:50.20   1st Qu.:37.40   1st Qu.:25.55  
##  Median :63.80   Median :52.20   Median :39.80   Median :28.40  
##  Mean   :63.82   Mean   :52.29   Mean   :39.76   Mean   :28.39  
##  3rd Qu.:65.75   3rd Qu.:54.30   3rd Qu.:41.85   3rd Qu.:31.65  
##  Max.   :68.80   Max.   :62.10   Max.   :47.50   Max.   :36.80  
##  NA's   :1       NA's   :1       NA's   :1       NA's   :1
##   Year  Jan  Feb  Mar  Apr  May  Jun  Jul  Aug  Sep  Oct  Nov  Dec
## 1 2015 26.3 14.4 34.9 49.0 64.2 68.0 71.2 70.2 68.7 53.9 39.8 28.4
## 2 2014 24.4 19.4 29.0 48.9 60.7 69.7 68.8 70.8 63.2 52.1 35.4 33.3
## 3 2013 22.7 26.1 33.3 46.0 63.1 68.5 72.9 70.2 64.6 53.2 37.6 26.7
## 4 2012 22.4 32.8 50.7 49.2 65.2 71.4 78.9 72.2 63.9 51.7 39.6 34.8
## 5 2011 15.3 24.2 33.1 45.5 58.1 68.7 78.7 70.8 62.3 52.2 44.8 34.1
## 6 2010 18.4 27.4 42.0 54.5 61.9 70.5 75.3 74.3 64.2 55.0 41.4 25.2
  1. list of variable names that define the different times or metrics (varying) r colN

  2. the name we wish to give the variable containing these values in our long dataset (v.names), This would be Temps

  3. the name we wish to give the variable describing the different times or metrics (timevar), This would be Months

  4. the values this variable will have (times), and This would be the colnames

  5. the end format for the data (direction) This would be wide to long

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

4 Problem 2.3 (Missing Data)

Introduce (artificially) some missing data in the Knee Pain datasee, impute the missing values and examine the differences between the original, incomplete, and imputed datasets.

4.1 Student response

The histogram (in black) of the complete variable, the histogram (in blue) of the observed values and the histogram (in red) of the imputed values. The imputations perform quite well comparing the blue trace and the black histogram. With 5 chains for 15% missing data, we have approximately values near 1, which indicate a successful and reasonable imputation. Although I find it weird that all the chains yield very similar values for X and Y.

## null device 
##           1
##           chain:1 chain:2 chain:3 chain:4 chain:5
## x           0.001   0.001   0.001   0.002   0.002
## Y          -0.001  -0.002   0.004   0.000   0.007
## View        2.801   2.801   2.801   2.801   2.801
## missing_x   0.150   0.150   0.150   0.150   0.150
## missing_Y   0.150   0.150   0.150   0.150   0.150
##    mean_x    mean_Y      sd_x      sd_Y 
## 1.2812240 1.0978629 0.8844699 0.9565299

5 Problem 2.4 (Surface Plots)

Generate a surface plot for the (RF) Knee Pain data illustrating the 2D distribution of locations of the patient reported knee pain (use plot_ly and kernel density estimation).

5.1 Student response

Surface plots are an interesting way to visualize the data. I think it would not be advantageous for timeseries data or very dense datasets since the peaks may be missed. Humans are also bad at assessing 3D representations, so a surface plot may be more confusing than a simple heatmap. We can see with this colormap that the peaks are yellow, and the troughs are blue. I also found that plot_ly can be called more simply with the “z=kd$z” parameter instead of using with() which may add a layer of complexity. Experimenting with colors may also be necessary since some colormaps are not representative of the dataset (like jet).

6 Problem 2.5 (Sample-Size Rebalancing)

Rebalance the groups of ALS (training data) patients according

Age>50vsAge50
based on the synthetic minority oversampling (SMOTE). to ensure approximately equal cohort sizes.

6.1 Student response

It took a bit of playing around with the inputs for SMOTE and ubBalance. In particular, the perc.over and perc.under arguments, as well as their specific proportions still elude me. I found myself experimenting with the proportions until the two groups had equal amounts of samples. Specifically, it seems like over-sampling the minority class by 200% and over sampling the minority class by 100% is optimal to obtain the same N.

SMOTE is preferable downsampling the majority class since that simply decreases our N. Instead, SMOTE artificially generates new examples of the minority class using nearest neighbors of these cases. It also undersamples the majority class leading to a more balanced data set. In this case, it additionally oversampled the majority class by approximately 10%. Getting ubBalance to work required going into the source code and finding out we needed to input a positive parameter for the minority class.

## Proportion of positives after ubSMOTE : 50 % of 3152 observations
## 
## Young   Old 
##  1435   788
## 
## Young   Old 
##  1576  1576
## 
## Young   Old 
##  1576  1576